首页> 外文OA文献 >Perbandingan Feature Kata dan Frasa dalam Kinerja Clustering Dokumen Teks Berbahasa Indonesia
【2h】

Perbandingan Feature Kata dan Frasa dalam Kinerja Clustering Dokumen Teks Berbahasa Indonesia

机译:印尼文本文档聚类性能中特征词和短语的比较

摘要

Text document clustering has been intensively studied because of its important role in text-mining andinformation retrieval. High dimensionality problem caused by high number of words is always happened inword-based clustering technique using vector space model. Although extracting words in the preprocessingphase is simple, the collection itself is not only can be viewed as a set of words but also a set of partly more thanone word phrase. Separating a phrase into its parts can eliminate the actual meaning of phrase. Therefore inorder to maintain the context of words a phrase must be maintain as a phrase. It is assumed that by addingphrases to words as features in clustering will improve the performance. This paper will study the comparison ofword-base and phrase-based clustering. Three clustering models was chosen i.e. hierachical, partional andhybrid model. Four similarity technique i.e. GroupAverage, CompleteLink, SingleLink, and ClusterCenter wastried for hierarchical, K-Means and Bisecting K-Mean for partitonal and buckshot for hybrid. Documentcollections from 200-800 news text that has been categorized manually was used to test these algorithms byusing F-measure as criteria of clustering performance. This value was derived from Recall and Precision andcan be used to measure the performance of the algorithms to correctly classify the collections. Results show thatby adding phrases or simply word pair, although it\u27s still not statistically significant, it slightly improves theperformance of clustering.
机译:由于文本文档聚类在文本挖掘和信息检索中的重要作用,因此对其进行了深入的研究。使用向量空间模型的基于词的聚类技术经常发生由大量词引起的高维问题。尽管在预处理阶段提取单词很简单,但是集合本身不仅可以看作是一组单词,而且可以看作是一部分以上的多个单词短语。将短语分为几个部分可以消除短语的实际含义。因此,为了维持单词的上下文,必须将短语保留为短语。假设通过在词中添加短语作为聚类中的特征将改善性能。本文将研究基于词的聚类和基于短语的聚类的比较。选择了三个聚类模型,即层次模型,局部模型和混合模型。尝试了四种相似性技术,即GroupAverage,CompleteLink,SingleLink和ClusterCenter用于分层,K-Means和二等分K-Mean用于部分式和buckshot用于混合。通过将F-measure用作聚类性能的标准,使用了手动分类的200-800个新闻文本的文档集来测试这些算法。该值来自Recall和Precision,可用于测量算法的性能以正确分类集合。结果表明,通过添加短语或简单地使用单词对,尽管它在统计上仍不显着,但可以稍微提高聚类的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号